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Sufficient dimension reduction methods often require stringent 
conditions on the joint distribution of the predictor, or, when such 
conditions are not satisfied, rely on marginal transformation or 
reweighting to fulfill them approximately. For example, a typical di- 
mension reduction method would require the predictor to have ellip- 
tical or even multivariate normal distribution. In this paper, we re- 
formulate the commonly used dimension reduction methods, via the 
notion of "central solution space," so as to circumvent the require- 
ments of such strong assumptions, while at the same time preserve the 
desirable properties of the classical methods, such as ^Fi-consistency 
and asymptotic normality. Imposing elliptical distributions or even 
stronger assumptions on predictors is often considered as the neces- 
sary tradeoff for overcoming the "curse of dimensionality," but the 
development of this paper shows that this need not be the case. The 
new methods will be compared with existing methods by simulation 
and applied to a data set. 

1. Introduction. Dimension reduction for regression [Li (1991, 1992), 
Cook and Weisberg (1991), Cook (1994, 1996)] is aimed at finding a lower 
dimensional vector of linear combinations of the predictors, which retains as 
much as possible the information in the relationship between the response 
and the original predictors. Let X be a p-dimensional random vector rep- 
resenting the predictor, and let Y be a random variable representing the 
response. If there is a p x q (q < p) matrix (3 such that Y and X are in- 
dependent conditioning on j3 T X (henceforth written as Y _U_ X\f3 T X), then 
the column space of (3 is called a dimension reduction space. Under very mild 
conditions, such as given in Cook (1998), Chiaromonte and Cook (2001) and 
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recently further relaxed by Yin, Li and Cook (2008), the intersection of all 
such spaces is itself a dimension reduction space. In this case we call the 
intersection the Central Space and denote it by Sy\x [Cook (1994, 1996)]. 
A basic problem of dimension reduction is to estimate and make statistical 
inference about Sy\x- 

Commonly used dimension reduction methods, such as those based on in- 
verse conditional moments, require rather strong conditions on the joint dis- 
tribution of X. For example, first-moment-based methods such as Sliced In- 
verse Regression [Li (1991)] and Ordinary Least Squares [Li and Duan (1989)] 
require the linear conditional mean assumption. That is, E(X\(3 T X) is a 
linear function of X. Second- moment-based methods such as Sliced Av- 
erage Variance Estimator [Cook and Weisberg (1991)], Principal Hessian 
Directions [Li (1992)] and Directional Regression [Li and Wang (2007)] re- 
quire, in addition, the constant conditional variance assumption. That is, 
V&r(X\f3 T X) is a nonrandom matrix. Since (3 is unknown, these conditions 
are assumed to hold for all possible (3. If the first condition holds for all (3, 
then X has an elliptically-contoured distribution [Eaton (1986)], if both con- 
ditions hold for all (3, then X has a multivariate normal distribution. Thus, 
in effect, either elliptically-contoured or multivariate normal distribution has 
to be assumed when applying these methods. 

If the actual predictors do not satisfy these conditions, current practice 
often relies on transformation — that is, transform the p components of X , 
(X\, . . . , X p ), to (h\(Xi), . . . , hp(Xp)) by some functions h±, . . . ,h p , so that 
the scatter plot matrix of the transformed predictors resembles that of a 
multivariate normal distribution. While transformation is a pragmatic — and 
often effective — strategy, it has both theoretical and practical difficulties. 
Theoretically, such transformations are intrinsically marginal. It targets the 
marginal distributions of X±, . . . ,X p , and as such does not guarantee that 
E(X\(3 T X) has desired linearity when (3 T X is not a set of AVs. Indeed, there 
can be hidden nonlinearity among the predictors even if their scatter plot 
matrix looks perfectly linear. On the other hand, marginal transformations 
may also be excessive: that E(X\(3 T X) is linear in X does not require every 
component of X to be linear against every other component. Practically, 
whether a transformation has succeeded in transforming a set of observed 
predictors to an elliptical shape often relies entirely on subjective judgement. 
Moreover, transforming a high dimensional predictor may be tedious or even 
infeasible. Another way of dealing with nonellipticity is reweighting [Cook 
and Nachtsheim (1994)]. However, like transformation, it is not focused on 
that part of the nonlinearity in the predictors that is relevant to dimension 
reduction. It is also computationally intensive, especially if the dimension p 
is high. 

When the linear conditional mean and/or constant conditional variance 
assumptions are satisfied, however, the above-mentioned methods share prop- 
erties that make them uniquely desirable among nonparametric methods. 
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First, the slicing (or smoothing) involved in these estimators is over the re- 
sponse Y, which is always one-dimensional, regardless of the dimension of 
X . It is well known that smoothing over a high dimensional vector space is 
undesirable, because the data points within a slice (or a region covered by 
a smoothing kernel) become sparse at an exponential rate as the dimension 
increases — a phenomenon often referred to as the "curse of dimensional- 
ity" [Bellman (1961)]. Second, the size of the slice (or bandwidth of the 
kernel) for the above methods need not decrease with the sample size for 
consistency. These properties make the above methods resemble paramet- 
ric estimators — they are -^/n-consistent regardless of the dimension of X and 
have simple asymptotic structure — even though the problems they tackle are 
in fact nonpar ametric, in the sense that virtually no assumption is imposed 
on the conditional distribution of Y\X. 

In this paper, we introduce a method that does not require linear condi- 
tional mean or constant conditional variance, while at the same time pre- 
serves all the desirable properties described in the foregoing paragraph. Al- 
though the basic idea can potentially apply to methods based on first and 
second inverse conditional moments (such as SIR, the Sliced Average Vari- 
ance Estimator and Directional Regression), here we will focus on first in- 
verse conditional moments. The new method is akin to inverse regression, 
but it is adapted in an automated fashion to the nonlinearity in the predic- 
tors, and only that part of the nonlinearity relevant for dimension reduction. 

In Section 2, we introduce the key idea of Central Solution Space, a 
construction that circumvents the linear conditional mean assumption. We 
study its relation with the Inverse Regression Space and the Central Space. 
In Section 3 we give a general formulation of inverse regression, which ac- 
commodates in a simple form five different dimension reduction methods 
in the literature and in doing so provides a platform on which to general- 
ize them to the nonelliptical situations. This generalization is then carried 
out in Section 4. Section 5 is devoted to issues involved in implementation, 
such as parameterization and optimization. The asymptotic distribution of 
the estimator is developed in Section 6. Sections 7 and 8 are concerned with 
simulation comparison and application. Finally, the proofs of the asymptotic 
results are given in the Appendix. 

2. Central solution space: The principle. The best way to explain the 
central idea of this paper is to explain it in comparison with Sliced Inverse 
Regression. Assume, without loss of generality, that E(X) = and E(Y) = 
0. Let £ be the covariance matrix of X, assumed to be positive definite. 
Suppose the Central Space has dimension d, and let be a p x d matrix 
whose columns form a basis in Sy\x- Sliced Inverse Regression is based on 
the following fact. If 

(1) E(X\f3 T X) is linear in X (linear conditional mean), 
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then the random vector £ 1 E(X\Y) belongs to Sy\x almost surely. To 
see this, let -P(X) be the projection on to Sy\x with respect to the in- 
ner product (a, b) = a T T,b (this will be called the S-inner product); that 
is, P(S) = (5{l3 T T l f3)- 1 (5 T i:. Condition (1) implies E(X\f3 T X) = P T {Y>)X. 
Hence, 

T,- 1 E{X\Y) = TT 1 E[E{X\P T X,Y)\Y] = TT 1 E[E{X\j3 T X)\Y] 

(2) 

= T 1 ~ 1 P T {T)E{X\Y) = P(S)S~ 1 E(X|y). 

Thus the random vector £ -1 i?(X|Y) belongs to the range of the projection 
operator P(S), which is Sy\x- Consequently, the column space of the matrix 

(3) S~ 1 cov[E(X|y)]S~ 1 

is a subspace of Sy\x- This column space will be called the Inverse Regression 
space, and written as 5ir,; the matrix (3) will be written as j4ir. 

At the first sight, linear conditional mean seems crucial in the foregoing 
argument. However, note that it is the second equality in (2) that reflects 
the conditional independence Y _LL X\0 I X, and it requires virtually no con- 
dition. The next two equalities in (2), which require linear conditional mean, 
merely serve to make S" 1 E(X\Y) an explicit vector in Sy\x- This leads us 
to pay special attention to the equation 

(4) E{X\Y) = E[E(X\(f X)\Y] a.s. 

That is, the inverse (L2-) regression of X on Y is the same as the double 
(L2-) regressions of X on (3 T X and then on Y. Because of the importance 
of this equation, we will call it the Inverse Regression Equation. Note that 
if [3 solves this equation, then so does (3 A for any d x d nonsingular matrix 
A. That is, the above equation is identified only up to the column space of 
P. 

Definition 2.1. If /3 is a matrix of p rows that satisfies the inverse 
regression equation (4), then span(/3) is called a solution space of inverse 
regression equation. 

It is easy to see that if f3\ satisfies (4) and is another matrix such that 
span(/3i) C span(/?2), then also satisfies (4). For maximum dimension 
reduction we would like to seek j3 of lowest rank. This leads to the notion 
of Central Solution Space. 

Definition 2.2. If the intersection of any two solution spaces of (4) is 
itself a solution space of (4), then the intersection of all such spaces will 
be called the Central Solution Space of the inverse regression equation and 
written as Scss- 



DIMENSION REDUCTION 



5 



By construction, if 77 is a matrix of dimension p x d\ with d\ greater than 
the dimension of 5css ; anc ^ ^ ^ solves equation (4), then span(?7) contains 

Central Solution Space is defined under the premise that the intersection 
of two solution spaces of (4) is again a solution space of (4). The similar 
premise also underlies the construction of the Central Space, which was 
recently proved under very weak assumptions by Yin, Li and Cook (2008) 
in that context. The proof in our context is similar and is omitted. 

The next proposition reveals the relation among 5csS) <?ir and Sy\x, 
which is the theoretical foundation of our method. We will say that condi- 
tion (1) holds for a subspace S of W if it holds for a matrix 77 whose columns 
form a basis in S. Henceforth, Pn(£) will denote the orthogonal projection 
on to span(r?) with respect to the S-inner product. 

Theorem 2.1. Suppose that Y and the elements of X are square inte- 
grable and E(X) =0. Then: 

1- <?css — &y\x; 

2. //, in addition, condition (1) holds for both 5css and 5t R , then 5i R = 
Scss- 

Proof. 1. Let (3 be p-row matrix such that span(/3) = Sy\x- Then, Y _L 
_L X\[3 T X, which, by (2), implies (4). Thus, S Y \x ls a solution space of (4), 
and assertion 1 follows. 

2. Let 77 be a p-xow matrix whose columns form a basis in 5css • If condition 
(1) holds for 77, then 

E(X\Y) = E[E(X\ V T X)\Y] = P?{Y)E{X\Y) = SP^^S" 1 E(X\Y). 

Hence, E^E^Y) = P^)^ 1 E(X\Y), and consequently 

(5) S-^ar^^ly)]!]^ 1 =P r ,(S)S~ 1 Var[E(X|y)]S- 1 P^(S). 

Thus, we have 5fr C Sqss- 

Conversely, let £ be a p-row matrix whose columns form a basis in tSi R . 
Then, 

E\\^,- 1 E{X\Y) - Pi.{Y>)Y>- l E(X\Y)\\ 2 

= tr(A m ) - tr[A IR if (£)] - tr[P 5 (S)A IR ] +tr[P 5 (S),4i R P c T (S)], 

where ^4t R is as defined in (3). Because span(^4i R ) = span(^) and because 
Am is symmetric, the last three terms on the right (without sign) all reduce 
to tr(Ai R ). Consequently, the above quantity is 0, implying 

YT l E(X\Y) = Pt.(Y>)Y,~ l E(X\Y) a.s. 
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Because E(X\^ T X) is linear in £, the right-hand side is 

Pi.{Y>)Y,- l E{X\Y) = S- 1 P e T (S)E(X|y) = T,- 1 E[E{X\fX)\Y]. 

□ 

Hence, 5 C ss C Sir. 

Observe that part 1 of the theorem holds without any assumption except 
the existence of moments; the linearity assumption is required only when 
5ir enters the picture. Thus, if we target Scss instead of Sir, then we can 
avoid the linearity assumption. 

3. A general formulation of inverse regression. Several important di- 
mension reduction methods are directly or indirectly related to the funda- 
mental fact that Sir C Sy\x under condition (1). These include Ordinary 
Least Squares (OLS) [Li and Duan (1989)], Sliced Inverse Regression (SIR) 
[Li (1992)], Parametric Inverse Regression (PIR) [Bura and Cook (2001)], 
Canonical Correlation [Fung et al. (2002)] and Kernel Inverse Regression 
(KIR) [Zhu and Fang (1996), Ferre and Yao (2005)]. All these methods 
rely on the condition (1) for their consistency. The original form of PIR of 
Bura and Cook (2001) was introduced under the assumption that an inverse 
parametric regression model is true and under that assumption no restric- 
tion needs to be imposed on X. However, PIR is in fact consistent when 
the parametric inverse model is not true, and, in this case, condition (1) is 
needed for its consistency. This fact is noted in Fung et al. (2002) in a dif- 
ferent context. The goal of this paper is to use the general mechanism of the 
Central Solution Space to extend these methods so that their consistency 
does not rely on condition (1). For this purpose, we now give a brief outline 
of the construction of these estimators and synthesize them into a common 
form. 

In the literature, the following estimators are typically described in terms 
of the standardized predictor. But for our purpose it is easier to describe 
them in terms of the original predictor (assuming EX = 0). This makes no 
difference at the population level (though it does make a difference at the 
sample level, where our experience indicates that it is often better to work 
with standardized predictor). 

The OLS estimator is based on the following matrix: 

Aols = ?,- l E{YX)E(YX T )Y>~ 1 . 

Let { Ji, . . . , Jfc} be a (measurable) partition of fty, the sample space of Y, 
and define the discretized version of Y as 

k 

5(Y) = J2mY£Je). 
i=i 
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The SIR estimator is based on the following matrix: 
A sm = Z~ 1 Var[E(X\5(Y))}Z-\ 
Let ip : R + — ► R + be a probability density function h > and y E fiy . Let 

(6) = ^(/T 1 ^ - yD/^fV^ 1 !^ 

Because /i will be treated as fixed throughout the theoretical development, 
we suppress the dependence on h from the notation. Let Y be a random vari- 
able having the same distribution as Y with Y _LL (X, Y). The KIR estimator 
is based on the following matrix: 

(7) A Km = e- 1 e{e[Xk(y,y)\y]e[]Fk(y,y)\y]}v- 1 . 

Finally, let h\, . . . , h s be square integrable functions from Qy to R, one of 
which (say hi) must be taken to be 1 if Y is not centered. Let H(y) = 
(h 1 (y),...,h s (y)) T . Let 

(8) p(y, y) = H T (y)E[H (Y)H T (Y)]^ 1 H (y) . 
The matrix 

Ap m = ^- 1 E{E[Xp(Y,Y)\Y}E[X T p(Y,Y)\Y}}^- 1 

is sufficiently general to accommodate (the population versions of) both PIR 
and Canonical Correlation estimator, though their original forms were quite 
different. We also note that both estimators allow Y to be a vector, but this 
is not considered in this paper. 

It turns that out all four matrices can be written in the same form, which 
will greatly simplify the subsequent development and provide insights into 
the relationship among these methods. Henceforth, for two random elements 

U and V, U = V means that they have the same distribution. 

Theorem 3.1. The matrices ^4ols> ^SIR; ^-kir, ^4pir can be written 
in the following form: 

(9) Tr 1 E{E[Xg{Y,Y)\Y]E[X' 1 ' g{Y,Y)\Y]}^\ 
where g : $7y x $7y — > R, Y JL (X,Y), andY = Y. 

Proof. That Akir and Apir have the form (9) follows from their defi- 
nitions. Also, if we let g(y,y) = y, then 

E(XY) = E(XY\Y) = E[Xg(Y, Y)\Y\. 

Thus, Aqls conforms to (9). 
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For Asm, note that, for any j G {1, . . . , k}, 

F\XWY\ = 1 = = E[XI{5{Y) = 5(Y))\5{Y)=j] 

[ 11 ) Jl ~ P(S(Y)=j) P(5(Y) = 6(Y)\6(Y)=j) ' 

Because Y = Y, the above equality implies that 

25 WCni = E[XI(S(Y) = 5(Y))\8(Y)]/P[5(Y) = 5(Y)\5(Y)]. 
Let g{Y,Y) = I(5{Y) = 5{Y))/P[5{Y) = 5(Y)\5(Y)}. Then, 

(10) E[X\S(Y)]^E[Xg(Y,Y)\5(Y)]. 
In the meantime, 

71(1,7)^(7,^)1(1,7) 

=► (X,Y) ALY\5(Y) => (X,Y)JLY\{6{Y),6(Y)}, 

which, together with S(Y) JL 7|<5(7), implies that 

(X,Y,8(Y))ALY\5(Y). 

See, for example, Dawid (1979) and Cook (1998), Proposition 4.6. Hence, the 
right-hand side of (10) reduces to E[Xg(Y, 7)|7], and equality (10) reduces 
to 

E[X\5(Y)]^E[Xg(Y,Y)\Y]. 
Thus, Asm also has the form (9). □ 

4. Extension to nonlinear predictor cases. The synthesis of the last sec- 
tion provides us a platform on which to extend the five methods to situations 
where the linear conditional mean condition (1) does not hold. We now carry 
out this extension. 

4.1. Central solution spaces. While Theorem 2.1 lays out the basic prin- 
ciple of Central Solution Space, as we have noticed in Section 3, various 
versions of inverse regressions do not take the exact form Var[_E(X|7)]. We 
now extend Theorem 2.1 to accommodate the various forms of inverse re- 
gressions, as synthesized in Section 3. 

Denote the matrix (9) by Ai&(g) and its column space by Sm(g), where 
g stands for the function g(Y,Y) in Theorem 3.1. Consider the following 
equation: 

(11) E[Xg(Y,Y)\Y] = E[E(X\f3 T X)g(Y,Y)\Y], 

where, recall that 7 = 7 and 7 JL (X, 7). Let Scss(<?) be the Central So- 
lution Space of this equation. 
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Theorem 4.1. Suppose that g:£ly x is a measurable function 

such that the elements of Xg(Y,Y) are square integrable. Suppose Y and the 
elements of X are square integrable with E{X) =0 and E(Y) =0. Then: 

1- <Scss(flO Q S Y \x; 

2. If, in addition, condition (1) holds for both 5css(fi') an d SiR,(9)> then 
Sm(g) = Scss(g) ■ 

Proof. 1. Let /3 be a matrix such that span(/3) = Sy\x- Because Y AL 
(X, Y) , we have 

Y AL (X, Y, fX) => Y AL (X, Y ) | f X. 

The expression on the right-hand side, together with Y AL X\(3 T X, implies 
that X AL Y AL Y\f3 T X, and hence that X AL (Y,Y)\/3 T X. It follows that 
E(X\Y,Y,f3 T X) = E(X\/3 T X), and consequently 

E[Xg(Y, Y)\Y}= E[E(X\(3 T X, Y, Y)g(Y, Y)\Y] 

(12) 

= E[E(X\(3 T X)g(Y,Y)\Y}. 

Thus, Sy\x is a solution space of (11), and assertion 1 follows. 

2. Let r\ be a matrix such that span(r/) =5033(5)- Since (1) holds for 77, 
we have i?(X|r7 r X) = P^(£)X, and so (11) becomes 

£[Xg(y,y)|y] = P^(JA)E[Xg(Y,Y)\Y] = ^P v (^ 1 E[Xg(Y,Y)\Y], 

which implies Ai R (p) = P ri (T,)A m (g)P^ (T,). Hence, Sm,(g) C <Scss(s)- Con- 
versely, let £ be a matrix such that span(£) =Si&(g). Then, 

EpA^E^XgiY^Y] - P i (Y)YT 1 E[{Xg(Y, Y)\Y] f 

= tr[A lR (g)]-tiiA m (g)Pl(Z)] 

-tT[P^)A m (g)}+tv[P^)A m (g)Pl(E)}. 

Because span[j4rR(g)] = span(£) and because Ain(g) is symmetric, the last 
three terms on the right (without sign) all reduce to trL4i R (g)]. Conse- 
quently, the above quantity is 0, implying 

Z- 1 E[(Xg(Y,Y)\Y]=P ( :(X)X- 1 E[(Xg(Y,Y)\Y) a.s. 

Because E(X\£ T X) is linear in X, the right-hand side reduces to 

P^)lA- 1 E[(Xg(Y,Y)\Y] = JA- 1 Pl(^)E[(Xg(Y,Y)\Y] 

= Z- 1 E[E(X\eX)g(Y,Y)\Y}. 

Hence, <Scss(s) = 5 ir(5)- d 
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Let Sols, 5 sir, <Skir, <Spir be the columns spaces of A ls, A S m, ^KIR, 
A m . Let Scss-ols, Scss-SIR, Scss-KIR, Scss-pir be the column spaces of 
Am (g) with g taken to be the four g(Y, Y) functions described in the proof of 
Theorem 3.1. The following corollary follows immediately from Theorem 4.1. 

Corollary 4.1. Suppose all the moments involved in the definitions of 
Sols, ■ ■■ , Spm and Scss-ols, ■ • ■ , Scss-Pm are finite. Then: 

1- <ScsS-OLS — "^YIXi^CSS-SIR Q $Y\X, ^CSS-KIR Q Sy\ x , ScSS-PIR Q S Y \x', 

2. If (1) holds for Sols, ■ • • , <Spir and Scss-ols, ■ • • , <Scss-pir, then 
^ols = ^css-ols, sstr = <scss-sir> 
5rir = 5css-kir, 5pir = 5css-pir- 

Again, note that inclusions in part 1 hold without linearity condition (1). 
Part 2 says that when condition (1) does hold, using Central Solution Space 
based methods will not lose information as compared to the inverse regres- 
sion based methods. 

4.2. Objective functions. We now introduce a population-level objective 
function whose minimizer yields the solution to (11) for each given g. We 
will also describe how it can be estimated based on an i.i.d. sample of (X, Y). 
The next theorem will provide a guiding principle for defining the objective 
function. 

Theorem 4.2. Suppose that Scss(d) has dimension d<p, and let f3 be a 
pxd matrix whose columns form a basis in Scss(d) • Let f(n T X) be a square- 
integrable function such that, whenever span(r]) = span(/3), f(f3 T X) = 
E(X\f3 T X), and whenever span(fy) ^ span(/3), 

(13) P{E[f( V T X)g(Y,Y)\Y}^E[f((3 T X)g(Y,Y)\Y}}>0. 
Let 7/o £ M. pxd be the minimizer of 

(14) L( V ) = E\\E{[X - f( V T X)]g(Y,Y)\Y}f 
overM. pxd . Then, span(?7o) = Scss(<?) • 

Proof. If span(7y) = span(/3), then 
E[f( V T X)g(Y,Y)\Y]=E[E(X\f3 T X)g(Y,Y)\Y} = E(Xg(Y,Y)\Y) a.s. 
Hence, L(rj) = 0. If span(?y) ^ span(/3), then, by assumption (13), 
E\\E{[f{rfX) - f(p T X)]g(Y,Y)\Y}\\ 2 > 0. 
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In the meantime, 

L( V ) = E\\E{[X-f(f3 T X)]g(Y,Y)\Y}\\ 2 

+ E\\E{[f((3 T X)-f( V T X)}g(Y,Y)\Y}\\ 2 
+ 2E(E{[X - f((3 T X)]g(Y,Y)\Y} T 

x E{[f(p T X) - f( V T X)]g(Y,Y)\Y}). 
Because span(/3) =5css(*?) 5 the last term is 0. Therefore, 

L( V ) > E\\E{[f(p T X) - f( V T X)]g(Y,Y)\Y}\\ 2 > 0. 
Hence, the minimizer of L(rj) must satisfy span(ry) =span(/3). □ 

Rather than assuming E(X\j3 T X) to be linear in f3 T X at the outset, as 
we do for classical methods such as SIR, here we model E(X\f3 T X) para- 
metrically. Let fi,---,fk be functions from M. d to R. We will assume that 
E(X\0 I X) lies in the space spanned by fi((3 T X), . . . , /fc(/3 T X). That is, each 
component of E(X\(3 T X) is a linear combination of fi((3 T X), . . . , fk((3 T X). 
Under this assumption, the conditional expectation E{X\[3 T X) can be ex- 
pressed explicitly as 

E(X\P T X) = E[XG T {p T X)]{E[G(I3 T X)G T {I3 T X)\}- l G{[3 T X), 

where 

G(l3 T X) = (f 1 (p T X),...,f k (P T X)) T . 

Note that we are not assuming — and we do not need to assume — that 
E{X\rj I X) is a linear function of fi(rj T X), . . . , fk(r] T X) for every r\ in M pxd . 
All we need is that this holds at the true (3. We use the function 

(15) E[XG T {r ] T X)]{E[G{r 1 T X)G T {r ] T X)]}- 1 G(i 1 T X) 

as the f(rj T X) in the definition (14) of the objective function L(r]). 

We now construct the sample estimate L n (rj) o£L(rj). Suppose that (Xi,Yi), 
. . . , (X n , Y n ) are an i.i.d. sample of (X, Y). For a function r(X, Y), let E n r(X, Y) 
denote the sample average n _1 Ya=i r {^-ii ^)- 

1. Center Y±, . . . , Y n and X±,..., X n as 

Y i = Y l -E n (Y), X i = X i -E n (X). 

2. Select {/i, . . . , fk} that we deem sufficiently flexible to describe the con- 
ditional mean E{X\0 T X) . For example, based on our experience, it often 
suffices to include linear and quadratic functions of f3 T X. In this case, 
the set {/i, . . . , fk} includes the following d(d + 3)/2 + 1 functions 

{l}U{ V fX:i = l,...,d}U{rijX V lX:l<j<k<d}, 

where rji, ... are columns r]. Let 

f(v T X) = E n [XG T (rj T X)]{E n [G( V T X)G T ( V T X)]r l G(rj T X). 
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3. If using OLS, define L n (r}) as 

E n \\(X-f( V T X))Y\\ 2 . 
If using SIR, define L n {rj) as 



-jZE n [I{Y G Jt)]\\E n [(X - f(ri T X))\Y G J e }\\ 2 , 
n l=i 



where 



E n [{X - fir, 1 X))\Y G Ji] = E n [(X - /(rj 1 X))I{Y G J e )]/E n [I(Y G J,)]. 

If using KIR, PIR or the Canonical Correlation estimator, define L n [rj) 
as 



n 



n 



where g is either the function k defined in (6) or the function p de- 
fined in (8). Note that, for PIR and the Canonical Correlation estimator, 
g(Yi,Yj) can be factorized into functions of Yj and Yj, and thus the above 
double sum can be simplified as a single sum. We will come back to this 
in Section 6. 

In the following, we refer to the CSS-based modification of a classical 
estimator as that estimator preceded by the prefix "CSS." For example 
CSS-SIR is the CSS-modification of SIR. 

That the CSS-based methods do not require linearity condition (1) also 
implies that they are no longer restricted to continuous predictors, because 
all we need is that {f\{(3 T X), . . . , fj~(P T X)} be sufficiently flexible to de- 
scribe E(X\f3 T X) whether or not X is continuous. In fact, the application 
in Section 8 shows that CSS-PIR handles a binary predictor effectively. 



5. Parameterization of objective function. In this section, we discuss 
special issues that arise in the minimization of L n {rj). The number 

E\\(X-E(X\p T X))g(Y,Y)\\ 2 

depends on (3 only through its column space and not its specific form. This 
raises the question of how to parameterize the column space parsimoniously. 
The similar problem arises frequently in dimension reduction; for example, 
it arises also in Xia et al. (2002), Cook and Ni (2005) and Cook (2007). 
The most parsimonious parameterization is via the Grassmann manifold, 
whose importance in dimension reduction computation is first noted in 
Cook (2007). 
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Here, we use a more elementary but rather intuitive parameterization — 
we assume that columns of [3 to be a set of d orthonormal vectors and 
parameterize them by the polar coordinate system. First, we represent the 
class of all orthogonal matrices, denoted by O pxp , using the polar coordinate 
system. Note that any matrix in 2x2 can be represented as 



(16) 



cos (a) 
sin(a) 



— sin(a) 
cos(a) 



For an arbitrary dimension p, the space M. p consists of (£) two-dimensional 
orthogonal hyperplanes, and an orthogonal matrix should be able to rotate 
a vector along all of them. Thus, any matrix in (Q) pxp is the product (2) 
orthogonal matrices, each resembling the above matrix. In symbols, for 1 < 



< j < P, let 9ij be an angle in [0, ir] and Bij(6ij) be the matrix in 



constructed by replacing the 



x U 



j) submatrix of the identity matrix 



I p with the matrix of the form (16). That is, 
/l ••• ••• ( 



cos(#. 



I.I I 



8111 



••• sin(#jj) 



Vo 







cos(#jj) 







ith row 



jth row 



1/ 



ith column jth column 



Then, any orthogonal matrix can be represented as 

(17) B= £y(0y). 

l<i<j<p 

We use the first d columns of B as the polar parameterization of i]. 

In this parameterization i] depends on 6, but not all of them. Let us see 
what is the subset of the Oij's that appear in r\. To begin, consider the case 
of p = 5 and d = 2. Then, 

(18) B = (i?i 2 i?i3i?i4.Bi5-B23-B24-B25) (-834-835-845)) 

where the parentheses are added artificially to assist discussion. Note that 
the first 2 columns of B^,B^,B^ are the same as those of I p . Therefore, 
the first 2 columns of -B34-B35-B45 are the same as those of I p . This implies 
that the first 2 columns of B and Byi • ■ • B25 are the same. In other words, 
-634-B35-B45 can be ignored without changing the first 2 columns of B. In 
general, ij depends only on the following Oifs: 



(19) 



{Oij-.l <i< d,i < j <p} = 
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There are pd — d(d + l)/2 = m parameters in this set. We write the param- 
eterized r\ as f]{8). That is, rj(8) comprises the first d columns of 

n B ij(8ij)- 

l<i<d,l<j<p,i<j 

Using this parameterization, we minimize L n (r](8)) over 8. Because 8^ 
and 9ij ± ir give the same direction, we maximize 8 over the set [0,7r] m . 
For the initial value of 8, we recommend using the corresponding classical 
methods such as OLS, SIR, KIR and PIR, or the Outer Product Gradient 
estimator (OPG) by Xia et al. (2002). Many softwares are available for 
minimizing functions like L n (rj(0)). For example, the OPTIM function in R 
works well for our purpose. All it requires is a subroutine that evaluates the 
objective function and an initial value of 8. We use span{7/(#)}, where 8 is 
the minimizer of L n (rj(8)) as the estimator of <ScssQ/)- 

We should note that the polar coordinate system is not the most par- 
simonious parameterization, in the sense that span(#) does not uniquely 
determine 6, even though 8 has much lower dimension of r\. Numerically, 
this causes no difficulty with an appropriately chosen initial value such as 
described above. However, this does mean that the objective function has 
a singular Hessian matrix, and this must be taken into account when we 
derive the asymptotic distribution of 8, as we do in the next section. 

It is possible for L n (rj(6)) to possess multiple minimizers, and we do oc- 
casionally run into this problem. However, this is mitigated by the judicious 
choice of an initial value. For example, OPG, which is easy to compute, 
seems to work very well. Furthermore, our experience indicates that as long 
as L n (r/(8)) is decreased (from the initial value) the performance of the CSS 
estimators tends to be enhanced regardless of convergence of the algorithm. 
Thus, if we use a robust minimization algorithm that guarantees to de- 
crease the objective function at each step, the issue of local minima should 
not cause serious concern. For example, the Nelder-Mead simplex algorithm 
[Nelder and Mead (1965)], as implemented in OPTIM mentioned previously, 
is robust in this sense. 

6. Asymptotic distribution. We now derive the asymptotic distributions 
of 8. Because of limited space, in this paper we will only tackle the asymp- 
totic analysis of CSS-PIR, which includes CSS-OLS as a special case. To fur- 
ther simplify computation, we only consider the case where hi(y), . . . , h s (y) 
are monomials of y and fi(rj T X), . . . , f)-{ri T X) are monomials of fjjX. Fur- 
thermore, we require that both sets of functions must contain the function 
that is constantly 1. Under this assumption, there is no need to assume 
E{X) = and E(Y) = (i.e., no need to center X as X — EX and Y as 
Y — EY) because, for example, 

{1,1/,..., I/ 8 " 1 } and {l,y-EY,...,(y-EYf- 1 } 
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span the same functional space. The development of the general case is par- 
allel to this simplified case but will have much more terms in the asymptotic 
expansion, complicating an otherwise transparent argument. Note that this 
restriction does not apply to estimation, where centering causes no addi- 
tional complication. 

For bookkeeping, we first give a one-to-one correspondence between the 
double index in (19) and a single index. Let J = :j = i + l,...,p,i = 

1, . . . , d}. For each G J, let 

t = t(i,j) =p(i - 1) - (i - l)i/2 + (j - i). 

Conversely, for each t G {1, . . . , m}, let 

i(t) = max{i : p(i - 1) - (i - l)i/2<t}, 

m = t - \p(i(t) - 1) - m - i)*(t)/2] + i(t). 

In this arrangement, as the double index runs through J with j chang- 
ing first, the single index t runs through 1 to m and vice versa. Let 

0t(ij) = = ^ij- 

Let = (0i, . . . , <p m ) T ■ The 77 can be equivalently parameterized by <fi as 

m 

Denote the range of 4>, [0,vr] m by fi<£. 

Let J be a convex class of distributions of (X, Y") , which contains the 
true distribution Fq and all empirical distributions. Let Ep(-) denote the 
expectation under F and E(-) denote the expectation under Fq. We can 
reexpress L(ji((j))) and L n (r]((j))) defined in Section 4.2 as evaluations of a 
mapping from x T to R, evaluated at the true distribution Fo and the 
empirical distribution F n , respectively. Let 

£(4>,F) = tv{E F [(X - /(, ? T (0)X))tf T (Y)] 

(20) 

x [E F {H{Y)H T {Y))Y l E F [H{Y){X - ftfW)X)) T \}. 

In this notation, L(rj((p)) and L n (j](<p)) becomes £(4>,Fq) and £(4>,F n ). 

As we have noted in Section 5, 4> is not uniquely determined by the sub- 
space span(r/(0)). Our asymptotic result reflects this fact by allowing the 
Hessian matrix of £((J),Fq) to be singular. Let 



g(<h,F) 



d£(tf,,F) 



W = W{cp ,F ) 

4>=<Po 



Fn 



<p=(f>0 
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Let Pw be the projection on to the column space of W, and let Qw = 
I m — Pw- By Taylor expansion, it is easy to see that 

£{00 + n- l l 2 5, F ) = n- l 5 T W5 + o{n~ l ), 

£{0o + n~ l l 2 P w 5, F ) = n~ 1 5 T W5 + o(n~ l ). 

That is, in a contiguity neighborhood of (fio, £(-,Fq) is unaffected by the 
component Qy/$ of the parameter. In other words, locally at <f>o, it is Pw$ 
that parameterizes the subspace span(r/(^o + n _1 / 2 Piy<5)), and the compo- 
nent Qw$ has no effect on this subspace. Similarly, at the sample level, it 
can be shown that (not presented here) 

£(l F n ) = £(<p + P W (4> - <p ),F n ) + Opin" 1 ), £(j>, F n ) = O p (n~ l ). 

Thus, Qw(<P~ 00 ) has n o effect on the sample objective function £(-,F n ). For 
this reason, the asymptotic distribution of relevance is that of ^JuPyjify — 
(j>o), rather than that of the full parameter y/n(cp — cfto). The next theorem 
gives the asymptotic expansion of y/nPw(<f> — 4>o)- Its proof will be given in 
the Appendix. 

Theorem 6.1. Suppose that the regularity conditions described in Sec- 
tion A.l are satisfied. Let be the Moore-Penrose inverse of W, and 
let g*(X,Y,(j>o,Fo) be the influence function of the mapping F i— ► g((po, F) 
evaluated at Fq. Then, 

(21) P w (4>- O ) = -W^E n g*(X,Y,<j> Q ,F Q ) + o p {rT 1 ' 2 ). 

The explicit expression for W is given by (32) through (35), and that for 
g*(X,Y,(fio,Fo) is given by (36) through (38) in the Appendix. 

From expansion (21), we can easily derive the asymptotic distributions of 

VnPw(4>-4>o)- 

Corollary 6.1. Under regularity conditions described in Section A.l, 

v^iW-0o)^^(O,A(0 o ,F o )), 
where A(0 O , F ) = W^E{g*(X, Y, O , F )[g*(X, Y, O , F )f}Wl 

In practice, we can estimate A(<fio,Fo) by replacing W with its sample 
estimate W(4>,F n ) and replacing E{g*(X,Y, (j) ,F )[g*(X, Y, O , F )] T } with 

n 

n- 1 (Xi,Yi, 4>, F n )[g* {X h Y h F n )f}. 

i=i 



x[,t 



*." T- •i. 



J 

"HUT* 
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Fig. 1. Scatter plot matrix for the ^-dimensional, nonelliptically-distributed predictor X . 



7. Simulation comparisons. In this section, we compare the CSS-based 
methods with their classical counterparts as well as two adaptive estima- 
tors when the predictor X has a nonelliptical distribution. We consider the 
following three models: 



Model I 
Model II 
Model III 



Y = e x * + (X 4 + 1.5) 2 + e, 

Y = 0.4Xf + 3sin(Xt/4) + 0.5e, 

Y = X 3 /[0.5 + (X 4 + 1.5) 2 ] + O.le, 



where e ~ N(0, 1) and e JL X. We first take the sample size to be n = 100. 
The dimensions of X are chosen to be p = 4,6,8. Note that, in all three 
models, d = 2 and Sy\x is spanned by (0, 0, 1, ... , 0) T and (0, 0, 0, 1, ... , 0) T . 

We introduce nonlinearity in the predictor as follows: X\ ~ N(Q, 1), X<i ~ 
#(0,1), 

X 3 = 0.2Xi + 0.2(X 2 + 2) 2 + 0.25, 

(22) 

X 4 = 0.1 + O.lpfi + X 2 ) + 0.3(Xi + 1.5) 2 + 0.25, 
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where 5 1L (X, Y) and 5 ~ iV(0, 1). When p = 6, 8, X§ through Xg are taken 
to be independent of iV(0, 1) and to be independent of {X\, . . . , X±). Figure 1 
shows the scatter plot matrix of X\, . . . , X4. Predictors of this type are very 
common in practice. 

We apply three methods based on Central Solution Space CSS-SIR, CSS- 
PIR and CSS-KIR, as well as their classical counterparts, SIR, PIR and 
KIR, to the three models. Because CSS-OLS and OLS can only estimate 
one-dimensional Central Spaces (d= 1), we do not include them in the com- 
parison. We also compare with OPG and the Minimum Averaged Variance 
Estimator (MAVE) introduced by Xia et al. (2002). The simulation sample 
size is N = 200. For SIR and CSS-SIR, the number of slices is taken to 
be 10, with each slice having equal number of observations. For PIR and 
CSS-PIR, the function H{Y) is 

H(Y) = (l,Y,Y 2 ). 

For all three CSS methods, the function G(rj T X) is taken to be 

G{ifX) = (1, r,JX, rgX, {^X)\ . . . , (^Xf). 

For the KIR and CSS-KIR, the function in (6) is taken to be the standard 
normal density, and the bandwidth h in (6) is taken to be 0.4. The kernel 
function for OPG and MAVE is taken to be the normal density, with stan- 
dard deviation (kernel width) taken to be 0.7 for p = 4,6 and 0.8 for p = 8. 
These parameters perform reasonably well in several pilot trial runs. 

To assess the accuracy of each method, we use the squared multiple cor- 
relation coefficient. Specifically, suppose U and V are d dimensional random 
vectors, and S[/y, Ejy and Tiy are the covariance matrix between U and 
V, the covariance matrix of U and the covariance matrix of V, respectively. 
Then the square multiple correlation coefficient is defined by 

(23) p 2 = trp^Sc/yS^SvT/S- 1 / 2 ] = trp-^EvT/X^Sc/v-S- 1 / 2 ]. 

See Hall and Mathiason (1990). The measure takes maximum value d if U 
and V have a linear relation and takes minimum if the components of U 
and V are uncorrelated. At the sample level, given an estimator (3 of /3, we 
use the sample version of the above measure based on 

T X u ...J T X n } and {fX 1 ,...,fX n }. 

Note that the larger value of this criterion corresponds to a better dimension 
reduction estimate. 

We compute the errors of estimation by the eight methods, for three 
models and three choices of p and across the 200 simulated samples. The 
results are presented in Table 1. 

Each entry of Table 1 is formatted as a(b), where a is the average of the 
above criterion across the 200 simulated samples and b is the standard error 
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of the average. From the table we see that the CSS-based methods are sub- 
stantially more accurate than their classical counterparts across all 9 cases, 
indicating that there is much to be gained by correcting the bias caused by 
nonellipticity. OPG and MAVE perform competently under nonellipticity, 
but on the whole their improvements are not as sharp as the CSS-based 
methods. In particular, the accuracy of CSS-KIR dominates that of OPG 
and MAVE in all 9 cases by substantial margins (relative to the standard 
deviations). CSS-PIR also performs better than OPG and MAVE in most 
(8 out of 9) cases. The performance of CSS-SIR is somewhat similar to OPG 
and MAVE. This is partly due to the fact that slicing is somewhat ineffi- 
cient, because the inter-slice information is not used — an aspect that cannot 
be improved by the CSS correction. The loss of intra-slice information by 
SIR is noticed by Cook and Ni (2006), who proposed a method to reduce it. 

For larger sample sizes, the performances of all estimators improve, and 
MAVE and the CSS-based methods become more similar. Table 2 compares 
CSS-KIR with KIR, OPG and MAVE for p = 6 and n = 200, 300, 400, 500. 



Table 1 

Comparison of CSS and classical estimators 



Model 


Method 


P 


= 4 


P 


= 6 


P 


= 8 


I 


PIR 


1.366 


(0.017) 


1.336 


(0.017) 


1.264 


(0.015) 




CSS-PIR 


1.658 


(0.021) 


1.631 


(0.017) 


1.393 


(0.017) 




SIR 


1.112 


(0.013) 


1.100 


(0.011) 


1.064 


(0.007) 




CSS-SIR 


1.735 


(0.018) 


1.423 


(0.020) 


1.293 


(0.019) 




KIR 


1.701 


(0.014) 


1.661 


(0.015) 


1.618 


(0.015) 




CSS-KIR 


1.832 


(0.010) 


1.711 


(0.014) 


1.637 


(0.017) 




OPG 


1.581 


(0.023) 


1.377 


(0.020) 


1.282 


(0.016) 




MAVE 


1.785 


(0.016) 


1.602 


(0.018) 


1.382 


(0.017) 


II 


PIR 


1.400 


(0.015) 


1.346 


(0.015) 


1.349 


(0.013) 




CSS-PIR 


1.755 


(0.018) 


1.558 


(0.021) 


1.476 


(0.021) 




SIR 


1.302 


(0.022) 


1.256 


(0.017) 


1.208 


(0.017) 




CSS-SIR 


1.789 


(0.013) 


1.439 


(0.021) 


1.333 


(0.021) 




KIR 


1.514 


(0.018) 


1.468 


(0.016) 


1.437 


(0.015) 




CSS-KIR 


1.794 


(0.015) 


1.551 


(0.022) 


1.480 


(0.020) 




OPG 


1.604 


(0.023) 


1.406 


(0.023) 


1.302 


(0.020) 




MAVE 


1.622 


(0.022) 


1.397 


(0.021) 


1.265 


(0.018) 


III 


PIR 


1.149 


(0.014) 


1.115 


(0.011) 


1.065 


(0.009) 




CSS-PIR 


1.839 


(0.014) 


1.694 


(0.018) 


1.557 


(0.020) 




SIR 


1.265 


(0.020) 


1.171 


(0.014) 


1.116 


(0.013) 




CSS-SIR 


1.833 


(0.008) 


1.552 


(0.020) 


1.454 


(0.019) 




KIR 


1.146 


(0.014) 


1.113 


(0.011) 


1.063 


(0.009) 




CSS-KIR 


1.862 


(0.013) 


1.705 


(0.019) 


1.613 


(0.019) 




OPG 


1.742 


(0.017) 


1.584 


(0.022) 


1.453 


(0.020) 




MAVE 


1.803 


(0.016) 


1.584 


(0.021) 


1.375 


(0.019) 
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Table 2 

Comparison KIR, CSS-KIR, OPG and M AVE for larger n's 



Method 


n = 200 


n = 300 


n = 400 


n = 500 


KIR 


1.704 (0.011) 


1.725 (0.010) 


1.797 (0.005) 


1.781 (0.005) 


CSS-KIR 


1.816 (0.009) 


1.846 (0.005) 


1.854 (0.004) 


1.861 (0.004) 


OPG 


1.506 (0.023) 


1.614 (0.022) 


1.681 (0.020) 


1.730 (0.021) 


MAVE 


1.824 (0.014) 


1.885 (0.012) 


1.847 (0.013) 


1.922 (0.009) 



The kernel width (of X) for OPG and MAVE are taken to be 0.6, 0.5, 0.4, 0.4, 
and the kernel width (of Y) for KIR and CSS-KIR are 0.3,0.2,0.1,0.1. The 
basis functions in H{y) now include third polynomials, and the basis func- 
tions in G{r] T X) include fourth polynomials. We see that, while OPG and 
KIR still trail behind CSS-KIR, MAVE catches up with CSS-KIR at around 
n = 400 and surpasses it at n = 500. This is because, as we can see from (22), 
the dependence of X\ and X2 on V3 and X4 involves the square root func- 
tion, and as a consequence E(X\X^, X4) does not belong to the polynomials 
of rj T X. We have also performed simulation comparisons parallel to those 
presented in Tables 1 and 2 with Y depending on X\ and X2 instead of X3 
and X4, in which case E(X\r/ T X) does belong to the polynomial family. In 
this comparison the advantage of the CSS-based methods is more striking, 
and, for larger n, MAVE no longer has the mentioned advantage. These 
results are not presented for the lack of space. 

8. Application. We consider data collected for Massachusetts four-year 
colleges in 1995, which are attempted to study how the percentage of fresh- 
men that graduate (Grad) depends on variables measuring quality of incom- 
ing students and features of the colleges. The data is provided as an example 
data set in MINITAB (release 15, data directory STUDNT12). We restricted 
attention to n = 46 colleges and p = 8 predictors, which are: the percentage 
of freshmen that were among the top 25% percent in their graduating high 
school class (Top25), the median mathematics SAT score (MS AT), the me- 
dian verbal SAT score (VSAT), the percentage of applicants accepted by the 
college (Accept), the percentage of accepted applicants who enroll (Enroll), 
the student-to- faculty ratio (SFRatio), the out-of-state tuition (Tuition) and 
whether the college is public or private (PubPriv) . Since PIR does not apply 
to binary data, we first compare PIR and CSS-PIR ignoring the PubPriv 
variable, and then incorporate PubPriv in the CSS-PIR analysis to see how 
the latter handles a binary predictor. 

The scatter-plot matrix in Figure 2 reveals nonlinearity among predictors — 
for example, in the relations between Top25 and Accept, Accept and Tuition, 
VSAT and tuition. The upper panels of Figure 3 present the scatter plots 
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of Y (Grad) versus the first predictors obtained from PIR (left panel) and 
CSS-PIR (right panel). 

Since the true model is unknown, we can no longer use criterion (23) 
to compare the performances of PIR and CSS-PIR. We will use instead a 
leave-one-out cross validation criterion to compare their performances [see, 
e.g., Allen (1974), Stone (1974)]. Let /L fe and /§_& be the estimated by 
PIR and CSS-PIR when (X^, Y k ) is deleted from the sample. From Figure 3, 
we see that the scatter plots are roughly linear. So, for each of the 46 leave- 
one-out samples, we fit linear models using both the PIR and the CSS-PIR 
predictors and predict the deleted Yk by pL k X k and 0L k X k using their 
respective linear models. The sums of squared prediction errors over the 46 



400 600 20 60 10 20 



top25 


t 


* * 

• 

1 


* ■ 

■ 


>.« • * 
.1 . ■ 

* 


-v « 
* . 

■V ■ r 

- 'x. ■ 

J 


- *i 

* 


■ 

» ■ * 


MSAT 






• 

1 

*<; . 

. ■ • 


1 




.■ 

. '.'» 


•*? 

i + 


VSAT 




V** 

'if' 


■ * ■ 




*\ ' , * ^ 




■*»*.'■ - ■ 
• * 


accept 


■.& T " 
j. - ■ ■ 
•* 




■'• »'. 


■ 1 ■ ■ 


, q * 

• - - \~ 




■ 

" 

3 . " 


enroll 






. ,'•'•;-";•> 




*#» ' . 
■* 

t*~ 

. ♦ * 


• * \ 
'•*'. 




SFratio 


«:*•*. 


■ ■'. • 

■*1 f 


J * ■"■ 
** * 
*» - 






* i * 
•* * * 


1U 


tuition 



20 60 100 400 550 20 40 60 5000 15000 



Fig. 2. Scatter-plot matrix for the seven continuous predictors of the Massachusetts col- 
lege data. 
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samples for PIR and CSS-PIR are, respectively, 6145 and 5203, indicating 
a respectable improvement by CSS-PIR. 

We now incorporate PubPriv and repeat the CSS-PIR analysis. The 9 
public schools are indicated by 1 and the 37 private schools are indicated by 
0. The lower panel in Figure 3 is the scatter plot of Y versus the first CSS- 
PIR predictor after incorporating PubPriv. The cross validation criterion 
is now further reduced to 4345, another appreciable drop, indicating that 
CSS-PIR handles the binary predictor effectively. 



APPENDIX: ASYMPTOTIC ANALYSIS 

A.l. Regularity conditions, notation and preliminaries. The estimator (ft 
in Theorem 6.1 is a function of the empirical distribution F n of (X\,Yi), . . . , 
(X n ,Y n ). That is, it has the form A(F n ), where A is a vector. Let Fq be the 
true distribution of (X, Y) and, for any a E [0, 1], F n ^ a = (1 — o)Fq + aF n . 
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Then, under regularity conditions, 

(24) A{F n ) - A(F ) = [dA(F n , a )/da] a=0 + o^n 1 ' 2 ). 

See von Mises (1947), Fernholz (1983) and McCullagh (1987). When an es- 
timator satisfies (24), it is called an asymptotically linear estimator [Bickel 
et al. (1993)]. A wide class of estimators fall into this category. In the fol- 
lowing proof, we will assume at the outset that expansion (24) holds for 
Theorem 6.1. This means that we will omit the proof of consistency and 
smoothness of the statistical functional A{-). General conditions for estima- 
tors defined by minimization of objective functions can be found in van der 
Vaart (1998), Chapter 5. 

The underlying sufficient condition for expansion (24) is that the map- 
ping F \— > A{F) is Frechet differentiable at Fq (or more generally Hadamard 
differentiable) , with respect to the || • ||oo in a convex family of distributions 
J- that contains Fq and all empirical distributions. This is not a strong as- 
sumption. All estimators discussed in this paper are either themselves func- 
tions of sample moments or solutions to equations constructed from sample 
moments. For example, the key components of SIR and CSS-SIR are the 
sample conditional moment E n \XI(Y £ «/&)]/ 'E n [I(Y G </&)], which is a ratio 
of sample moments. Statistics of this form are typically Frechet differentiable 
under mild conditions. See, for example, Fernholz (1983), Chapter 2. 

In our context, the Frechet derivative of A(-) at Fq can always be repre- 
sented as the linear mapping 

(25) F^E F A*(X,Y,F ), 

where A*(X, Y, Fq) satisfies E Fo A*(X, Y, Fq) = 0, with its elements belonging 
to L2(Fo). When it causes no ambiguity, we will abbreviate A*(X, Y, Fq) by 
A*(Fq). Because of the one-to-one correspondence between the random ele- 
ment A*(Fq) and mapping (25), we will refer to A*(Fq) itself as the Frechet 
derivative. Moreover, the Frechet derivative, when it exists, coincides with 
Gateaux derivative, defined as the mapping 

F^[D a A((l-a)Fo + aF)] a=0 . 

Hence 

(26) [D a A({l - a)F + aF)] a=0 = E F A*(F ), 
and consequently the expansion (24) can be rewritten as 

(27) A(F n ) = A{Fq) + E n A*(F ) + o^n" 1 / 2 ). 

The next lemma provides some basic formulas for Frechet derivatives, 
defined as the random element A*(Fq). Let p: T — > G C M s be a Frechet 
differentiable mapping and r(p(F),F) be a real, vector or matrix- valued 
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Frechet differentiable function on T . Let po = p(Fq)- The symbol t*(po,Fq) 
denotes the Frechet derivative of the mapping F — ► r(po,F) at Fq; whereas 
t*(Fq, pq(Fq)) denotes the Frechet derivative of the mapping F — > r{p{F),F) 
at F . 

The subsequent expansions demand an efficient notation system for differ- 
entiation. We will frequently encounter mappings of the form Q(p(F a ), F a ), 
where F a = (1 — 0)^0 + otF for some distribution F £ T and some vector- 
valued function p(-) defined on T. We use D a Q(p(F a ), F a ) to denote the 
(total) derivative dQ{p(F a )F a ) / da and d a Q(p, F a ) the partial derivative 
dQ(p, F a )/da. We use d Pi Q(p, F a ) to denote dQ(p, F a )/dpi, where pi is the 
i component of p, and use d p to denote the vector of differential operators 
(d pi , . . . , d Pm ) T . We use d p to denote the matrix d p dj . For a single opera- 
tor such as d Pi , both d Pi Q and Qd Pi are to be understood as the derivative 
dQ/dpi. This is so that differential operation behaves, to a degree, like ma- 
trix multiplication. For example, if q is a vector-valued function of p, then 
qdj represents the matrix 

(qd pi ,...,qd pm ) = (d pi q,...,d Pm q) 

and d p q T denotes the transpose of the above matrix. This notation is help- 
ful in tracking the dimensions of derivative arrays and the ways in which 
derivatives are arranged in an array. Note, however, that the associative law 
does not apply: {q\d T )q2 / qi(d T q2). For this reason, we will always use 
parentheses to associate a differential array with the function on which it 
operates. 

Lemma A.f . The following relations hold: 

1. If the mappings F 1— ► p{F) and F 1— ► r(p(F),F) are Frechet differentiable 
with respect to F at Fq, then 

s 

(28) r*(p(F ), F Q ) = $> Pi r(p, F )] p=po p*(F ) + r*(p , F ); 

i=l 

2. If p(F) is a linear functional of F, that is, if p(F) = Ep\r{X, Y)\ for 
some square-integrable (and vector-valued) function r of (X,Y) that does 
not depend on F, then 

p*(F ) = r(X,Y)-E[r(X,Y)}. 

Proof. Part 2 is well known; [see Fernholz (1983), page 8]. By differ- 
entiation, 

s 

[D a T(p(F a ),F a )] a=0 = Y^[d Pi T(p,Fo)} p=po [D a pi(F a )] a=0 + d a T(p , F ). 
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By (26), [D aPi (F a )] a=Q = E F p*(F ) and [d a r(p , F a )] a=0 = E f t*{ Po , F ). 
Hence, 



[D a T(p(F a ),F a )] a=0 = E F 



J2[d P MP,Fo)]p=PoPKFo)+r*( Po ,F 



Li=l 



By (26) again, the expression inside the brackets on the right-hand side is 
r*(p(F ),F ). □ 

A.2. Proof of Theorem 6.1. Let 

R 1 (F) = E F [XH T {Y)}, 
R 2 ^,F) = E F [XG T ( V T ( ( f>)X)}, 

(29) R 3 (<t>,F) = E F [G{ri r {4>)X)G T {r ] T { ( j ) )X)), 
R 4 (^F) = E F [G(r] T (4>)X)H T (Y)], 

R 5 (F) = E F [H(Y)H T (Y)]. 

Note that only R2(4>,F), R^{cj),F) and R±(<I>,F) depends on <f>. Let 

R((f>, F) = R 1 (F) - R 2 {4>, F)R^ 1 (4>, F)R^, F). 

Then, by the definition (15) of f{r] T X), the £((/), F) in (20) can be reexpressed 
as 

£(0,F)=tT{[R 1 (F)-R 2 (<j>,F)Rz 1 ( ( f>,F)R A ( ( t>,F)}Rs 1 (F) 

(30) x [Ri(F) - R2{4>, F)R^ 1 (4>, i ? )i? 4 ((/>, F)] T } 
= tr[R( ( f > ,F)R^(F)R T (ct>,F)}. 

Let 4>{F) be the minimizer of £(4>,F). In this notation, <j) and 4>o defined 
in Section 6 are expressed as <P(Fq) and 4>(F n ), respectively. Recalling that 

g((f),F) = d <t> £(^,F), we have 

g(<l>(F),F) = 

for all F £ T . Take Frechet derivative on both sides, using (28) to obtain 

Wcj ) *(F )+g*(cf> ,F ) = 0. 

Here, g*(4>o, Fq) is to be understood as the Frechet derivative of F — > g(4>o,F) 
at Fq [recall that g*((f>o,Fo) is the abbreviation of g*(X, Y, 4>q, Fq)\. Multiply 
both sides of the above equality by W', and use the fact W*W = P w to 
obtain 

(31) Pw^(Fo) = -W^g*^ ,Fo), 
which, by (27), implies (21). 



26 



B. LI AND Y. DONG 



It remains to compute the m x m nonrandom matrix W and the Tri- 
dimensional random vector g*(<fio,Fo). Because R = R\ — R2R 3 1 Ra = at 
(4>o,Fq), and because R§ does not depend on eft, the (t,u)th element of 
a%£(<j>o,F ) is 

(32) W tu = 2tr[(d^R)R^(d^Rf]. 

Here and below, symbols such as d§ t R and i?5 abbreviate functions d^R^, F) 
and R$(F) evaluated at ((J)o,Fq). Because Ri(F) does not depend on <fi, 

(33) dfoR = —(d c f )t R2)R 3 1 Ra + R2R3 1 (d ( j )t R3)R 3 l Ri — R2R3 1 {d < j >t Ri). 
Let r)^ denote the p x d derivative matrix drj/dcfrt evaluated at <j)Q. Then, 

d^ t R 2 = E[(X - EX){X - EX) T ^ t G T ], 

(34) d 4t R 3 = E(Gr$ t (X - EX)G T )+E(G(X - EX) T ^ t G T ), 

a^IU = E{Gi^{X-EX)H T ). 
The derivative 7/<^(</>) can be conveniently computed as follows: 

(35) = • • • S t _i(0 t _i)[^ t ^(^)]fi t+1 (</) t+1 ) • • • £ m (0 m ), 

where d ( f >t B t ((j)t) is a p x p matrix whose (i(t),j(t)) x (i(t),j(t)) submatrix 
is 

/-sin(^ t ) -cos(<p t )\ 
V cos(^j) -sm((j> t )) 

and whose other elements are all 0. 

We now derive g*(4>o, Fq). Differentiating £(<f>,F) with respect to <pt, and 
evaluating it at 4>q, we have 

g t (cPo,F) = 2tr{[d^R(<Po,F)}R^(F)R T ( ( f> ,F)}. 

Hence, the ith component of g*((fto,Fo) is 

(36) 9*t(<Po,F ) = 2tr[(a^ J R)i? 5 - 1 ( J R*) T ]. 

By Frechet differentiation of F — > R((pQ,F), evaluated at Fq, we have 

/Q7"\ p* p* p* p 1 p 1 p* p— 1 p* p~l p p p — 1 p* 

\ol ) tt — JX^ — rt2-fl3 -fl4 -f- XI2-1I3 .ri3-rt3 -ft4 — 1X2-1X3 tt±. 

Here, symbols such as R2 denote the Frechet derivative of the mapping 
F 1 ^ R2(4>o,F) evaluated at (fo- Using part 2 of Lemma A.l, we deduce that 

R\ = XH T (Y) - E[XH T (Y)\, 

Rl = XG T (n T (ct> )X) - E[XG T (rj T (<p )X)], 
(38) R* 3 = G( V T ( ( l ) o)X)G T (r ] T (cj )0 )X) - E[G( V T ' (fa)X)G r \ V T '(<f>o)X)], 

Rl = G(r) T (<i> )X)H T (Y) - EiG^i^X^iY)], 

R* 5 = H{Y)H T {Y) - E[H(Y)H T {Y)}. 
This completes the proof. 
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